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1 Summary 

There are many proposed vision based methods to perform obstacle detection and avoid- 
ance for autonomous or semi-autonomous vehicles. All methods, however, will require very 
high processing rates to achieve real time performance. A system capable of supporting 
autonomous helicopter navigation will need to extract obstacle information from imagery 
at rates varying from ten frames per second to thirty or more frames per second depending 
on the vehicle speed. Such a system will need to sustain billions of operations per second. 
To reach such high processing rates using current technology, a parallel implementation of 
the obstacle detection/ranging method is required. This paper describes an efficient and 
flexible parallel implementation of a multisensor feature- based range-estimation algorithm, 
targeted for helicopter flight, realized on both a distributed-memory and shared-memory 
parallel computer. 

2 Introduction 

The design of intelligent low-altitude guidance systems for helicopters requires information 
about objects in the vicinity of the flightpath of the vehicle. The sensor system on the 
helicopter must be able to detect objects such as buildings, trees, poles and wires during 
flight. A complete obstacle-detection system may consist of an active ranging sensor and 
passive ranging using electro-optical sensors. A comprehensive overview of this problem can 
be found in references [1, 2, 3). 

Several techniques have been proposed for range determination using electro-optical 
sensors [4, 5, 6]. These techniques use optical flow resulting from the relative motion between 
the sensor and objects on the ground together with the helicopter state from an inertial 
navigation system to compute range to various objects in a scene. One algorithm of interest 
can detect, track, and estimate range to image features (i.e., patches of an image with 
common statistics or spatial structure) over time from a multisensor system mounted on a 
vehicle moving with arbitrary six degrees of freedom [7]. 

The estimation of range using electro-optical sensors involves large volumes of data, for 
example, 15 MB/sec for 8-bit grey scale stereo images at 30 frames per second, and requires 
processing power in the range of a few billion operations per second. Today, there is no 
single off-the-shelf microprocessor or digital signal processor which can meet this demand. 
The large amount of computation required to solve problems in computer vision is well- 
recognized, and parallel processing presents an approach to achieve the speed necessary for 
real-time implementation [8]. However, parallel processing does not provide a linear increase 
in speed and the actual increase depends on the computer architecture and the application 
[9]. The selection of a parallel processing architecture for the range-estimation problem has 
to examine the trade-offs between several architectures in terms of their effect on overall 
speed increase, processor utilization, programmability, and physical constraints. A promis- 
ing system must be adaptable to changes in the vision algorithm, exhibit good scalability, 
and must be installable, at some point, on board a helicopter. The constraints of high speed, 
algorithm flexibility, and system scalability favor a general-purpose parallel RISC-based sys- 
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Figure 1: Obstacle geometry. 


tem over traditional pipelined image processors. This does not prevent a viable system from 
using a traditional image processor as a “front end” to the parallel computer. Simply due 
to the algorithm’s complexity, a traditional “image processing” approach will be lacking in 
flexibility [10]. 

This paper presents a multiple-sensor range-estimation algorithm along with a discussion 
of an efficient and flexible method of parallelization which is necessary to realize real-time 
operation on a distributed-memory or shared-memory parallel computer. The paper is orga- 
nized as follows: Section 2 gives a quick overview of the mathematical background of optical 
flow. Section 3 describes the extended- Kalman-filter- based range-estimation algorithm, ex- 
tension of the procedure to multiple sensors, initialization procedure and an introduction 
to the multirate Kalman filter [11]. Section 4 discusses the feature tracking algorithm and 
section 5 describes virtual processing regions, a software abstraction used for parallelization. 
Section 6 describes three load balancing schemes based on virtual processing regions. Section 
7 presents some initial results using a distributed-memory parallel computer composed of a 
network of workstations and a modern shared-memory multiprocessor. Section 8 completes 
the paper with some concluding remarks and a discussion of future work. 

The authors would like to thank Silicon Graphics Inc. for access to a IRIS 4D/480 for 
timing measurements. 

3 Optical Flow 

Consider a rotorcr aft- mounted sensor rotated with respect to the body axis by the orthonor- 
mal rotation matrix Tj,, and offset from the vehicle’s body axis by /(,. The sensor, in motion 
with respect to an inertial Earth-fixed world axis system, observes an obstacle O whose 
location is fixed in the Earth frame as shown in Fig. 1. We wish to determine the relative 
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position of the obstacle O with respect to the sensor 


p = r 0 - r a (1) 

The imaging sensor maps the world object O whose location in sensor axes is p s — [pax, Pay, Psz] 
onto the image plane at Q by perspective projection according to the equation 


U _ T f pay / Pax 
.V . ifPaz/Psx 


(2) 


where / is the focal length of the sensor. As the sensor moves, p a changes and so the image 
location corresponding to O changes as follows: 


U 1 f pay Upax 

V . pax - f P sz V pax J 


( 3 ) 


The sensor frame is moving so the derivative of p, is determined using the Coriolis equation 


f>w = Pa + W X p 


( 4 ) 


where p w is the derivative of p in world axes and is equal to the negative of the sensor velocity 
in world axes, p a is the derivative of p in sensor axes, and u; is the rotation of t,he sensor axes 
relative to the world axes. Let V a = \Vax, V^ y , Vaz^ an d be the linear 

and angular velocity of the sensor with respect to the world frame and resolved in the sensor 
axes. Then noting that 0 is fixed in world axes and using equations (1) and (4), we obtain 
the relation 

Pa = -V, ~ U>, X Pa ( 5 ) 

The motion of the image point corresponding to O can now be written in terms of the sensor 
motion using equations (3) and (5), giving the result 


u = uj -f tin 
v = vt ■+■ vr 

ut = (— fV 9y + uV 3X ) / p 3X 

VT = ( — fVaz 4 vVax) / Psx (^) 

UR = fUay~f( 1 + £) Uaz + Max 
VR = f (l + yi) — ^yUJaz — UU>ax 


The motion of the image point corresponding to O due to sensor motion is known as optical 
flow. Here the optical flow has been decomposed into components due to translational and 
rotational motion of the sensor, denoted by the subscripts T and /£, respectively. and 
u) t can be derived from the rotorcraft’s inertial navigation system. With knowledge of the 
sensor motion, the focal length, and the optical flow [u, u] obtained from a feature tracking 
algorithm, the range, p 3 x > of an object O at the image location [u, u] can be determined from 
the optical flow equations (6). The full vector p, can then be recovered with the perspective 
projection equation (2). Knowledge of the dynamics of a sensor in an Earth-fixed inertial 
frame is an essential element in this range-estimation algorithm. 
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4 Kalman Filter 


There have been many published methods of extracting range from motion imagery [12, 
13, 14, 15, 16]. A simple method would be to measure the optical flow between every 
consecutive frame, and using equations (2) and (6), as described earlier, extract the object 
location p a . The trouble with such a method is the unreliability and low signal-to- noise 
ratio of a single measurement of [u,u], due mainly to pixel quantization and inaccuracies 
of subpixel localization. To greatly improve the range estimate of an object, an extended 
Kalman filter (EKF) is used to recursively estimate p , given multiple measurements of [u, t>]. 
Several Kalman filter implementations were studied by Sridhar and Phatak [4], who 
obtained the best results by selecting the state vector X = p a and the measurement vector 
Z = [u, u], With these definitions, equation (5) becomes the state equation and the perspec- 
tive projection equation (2) become the measurement equation. The state and measurement 
equations can be written as follows 

X = -[u.]X-V. 

Z = h(X)=[fp„/p.„fp.Jp„] T (7) 

where 

0 ~U) 3Z U,y 

(w,] = u az 0 -u ax (8) 

~U S y U), X 0 

The state equation is a time varying linear system that depends on the camera’s translational 
and rotational velocities. The measurement equation is a nonlinear function of the state. 

The continuous time state and measurement equations can be converted to their discrete 
time equivalents assuming that V s and u> a are constant during the sampling interval AT. The 
discrete time system equations are 

X(k + l ) = $(k)X(k) + T(k)U(k) + T d (k)( x (k) (9) 

Z(k) = *[*(*)] + C,(*) (10) 

where $(&) is the state transition matrix, T(k) is the input distribution matrix, U(k) = 
— V a (k) is the control vector, T d (k) is the disturbance distribution matrix, and Cx(^) an< i 
£ z (fc) model the process noise and measurement noise, respectively. Zero mean gaussian 
white noise is assumed such that R(k) = cov(( z (k)) and Q(k) = cou(C x (A:)). The state 
transition matrix and the control distribution matrices derived by Sridhar and Phatak can 
be found in [4]. The measurement equation is linearized about the current estimate of X 
giving 

Z(Jfc) = H(k)X(k) (11) 

= «(*)/«-/[:£'£ llp 0 ~ (i2) 
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Figure 2: Two-sensor initialization. 


The discrete time state equation (9) and the linearized measurement equation (11) can be 
used in a standard Kalman filter to recursively estimate the state vector X and the state 
covariance matrix P. 

The Kalman filter consists of two parts: the measurement update which improves the 
state estimate given a new measurement, and the time update which propagates the state 
forward in time according the system dynamics. Before each iteration of the Kalman filter, we 
have estimates of X(k), P(k ), Q{k ), and The measurement update is then performed 

according to the following equations: 

X(k) = X(k) + K(k)[Z(k)-h(X{k))] 

P(k) = [I - K(k)H(k)]P(k) (13) 

where i/(fc) is computed from X(k) as described above and the Kalman filter gain K[k ) is 
computed using the equation 

K(k) = P(k)H(k) T [H(k)P(k)H(k) T + Rik)}- 1 (14) 

The time update equations are 

X(A: + 1) = ${k)X{k) + T{k)U{k) 

P(k+1) = ${k)P{k)$(k) T + T d (k)Q(k)r d {k) T (15) 

4.1 Initialization 

As noted above, the Kalman filter requires initial estimates for X and P. The initial estimate 
for X can be derived either of two ways. The first method is based on measurements of a 
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feature’s location within two images A T seconds apart from a single moving camera. The 
second method, applicable in multisensor configurations, uses measurements from two sensors 
displaced in space only. The initial estimate of the state covariance matrix P is chosen a 
priori. 

The single sensor initialization method uses the optic flow equations, the perspective 
projection equations, and the camera’s translational and rotational velocities which are as- 
sumed constant during the interval AT between images. First, the optic flow equations (6) 
are solved for x,, where [u,v] is the feature location in the sensor image plane. The optic 
flow equations actually comprise an overdetermined system of two equations in the one un- 
known i,, so a single quadratic equation in x a is formed by summing the squares of the two 
optic flow equations. Once i, is found, y, and z, can be determined from the perspective 
projection equations (2). 

A two-sensor stereo analysis is used to generate the initial estimate for a multisensor 
image sequence. In Fig. 2 two sensors are shown from an n-sensor configuration. One of 
the n sensors is designated as the master sensor and any of the others may be chosen as the 
slave sensor. An object O in the field of view (FOV) of the master and slave sensors will 
be imaged by both sensors. If occlusion effects are ignored then the image plane locations 
of O in the master and slave sensors are [ um,vm] and [us,i>s] defined by the perspective 
projection equation (2) and vectors pM and ps, respectively. The slave sensor is located at a 
position d with respect to the master sensor, and the transformation from the master sensor 
to the slave sensor is given by an orthonormal rotation matrix Tm,s ■ The master and slave 
sensors and the object O are related by the following equation: 

Ps = Tm,s(pm — d\f ) ( 16 ) 

Equation ( 16 ) can be expressed term by term 

PSx til *12 *13 PMx — d\tx 

pSy = *21 *22 *23 PMy — d-My ( 17 ) 

PSz J [ *31 *32 *33 J L PMz ~ d Mz 

If we assume pM x ~ p$ x = Ac, (i.e., 0 is approximately the same distance from the center 

of each sensor’s axis system) then equation ( 17 ) gives rise to two equations which may be 
solved for p x . Equation ( 17 ) suggests that the master and slave sensors may be arbitrarily 
placed. It is numerically desirable though to place the master and slave sensors such that 
the roll and pitch between the sensors axe minimized and d\f has a major component along 
either yM or zm- This is usually the case in standard two-sensor stereo setups. Equation 
(17) simply gives the full relations when two sensors do not have their scanlines registered. 
In our setup (as in most setups), the displacement d^ between the master and slave sensors 
will be dominated by the d\f y component; therefore, we have chosen to solve equation ( 17 ) 
for p x using the equation for ps y 


/M(*2ldMi + *22<*Af v + *23<*A/ Z ) 
*2l/M + t 22 UM + *23 PA/ — ^ff u S 


(18) 


Equations (18) and (2) can then be solved for pM ■ This will be used as the initial state for 
the EKF in the master sensor’s axis system. 
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4.2 Multirate EKF 

An image feature belonging to a far-away object or a feature near the FOE may have an 
interimage motion smaller than can be resolved by the measurement process [7, 17]. The 
effective signal-to-noise ratio of shift measurements can be increased by lengthening the time 
interval between images. The increased time interval can be effected by pausing one or more 
frames before a new measurement is made. This method essentially increases the motion 
baseline for the optical flow measurements. Each feature will have an optimal measurement 
delay m such that its measured shift between frame k — m and frame k is greater than some 

constant value 

dr < ^/(u(fc) - u(fc - m)) 2 + (v(fc) — v(k - m)) 2 (19) 

where [u(Jb), v(k)] is the measurement at time k. The choice of dr is based on the a priori 
estimate of the measurement noise ( z . The measurement noise is affected by the pixel 
quantization noise and the accuracy of the subpixel interpolation scheme used in subpixel 
correlation measurements, which is discussed in a following section. It should be noted that 
equation (19) is meaningful only if the sensor locations from which the two measurements 
have come have not rotated with respect to each other for at least mAT seconds. The reason 
for this, based on the optical flow equations (6), is that the rotational components of the 
flow [u/i, vr\ do not contain any information of an object’s range. It would be unwise to use 
this method, based on equation (19), to increase the signal-to-noise ratio of distant features 
without first removing the feature’s rotational component from the shift measurement. Since 
the helicopter will always be rotating somewhat over time, one of the candidate measurements 
needs to be rotated into the sensor frame of the other measurement before comparing them 
using equation (19). To up-rotate a measurement from the coordinate system at time k — m 
to the one at time k , the two equations for from (6) are used in their discrete form. 

These two equations can be written as a nonlinear vector function g{ m ) of the angular rates 
and image location [u,t>] of the feature at time i 



Equation (20) is used to calculate the image velocity induced by rotation of the sensor axis. 
In order to up-rotate a feature location [u(i — l),t;(i — 1)] to time t, the rotational velocity 
g(i - 1) is multiplied by the sampling time AT and is added to the image plane location of 
the feature at time i — 1. This will give the up-rotated image plane location at time i. This 
process is performed m times to find the location of [tt(fc m),v(k m)] at time k. The 
following is the iterative equation to up-rotate features 


■«(*)] _ [ u(i — 1)' 
,t>(i) J v(i — 1) . 


4 -g(i - 1)A T, 


i = k — m + 1, • • • , fc — 1 


( 21 ) 


The Kalman filter time updating is still performed at a constant rate which is equal to or 
faster than the smallest measurement delay possible. If a feature has an optimal measure- 
ment delay m, then every mth image a new measurement update is performed according to 
equations (13) and (14); otherwise, a trivial measurement update is performed according to 
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Figure 3: Multisensor geometry. 


the following equations: 

X(k) = X(k) 

p(k) = P(k) 

It should be noted that as a distant object approaches the sensor array its interimage shift 
will increase. Therefore, the measurement delay m for a feature will decrease over time until 
m = 1, indicating that a measurement is made during each frame. 


5 Feature Tracking Algorithm 

The most difficult aspect of feature-based passive range estimation is the accurate mea- 
surement of the optical flow (interimage shift) for each feature. Results of earlier research 
indicate that at least two passive sensors should be used for range estimation to eliminate 
problems associated with subpixel motion near the FOE [18]. The combination of stereo 
and motion processing has been found to produce a more robust range map than motion 
or stereo alone. The geometry for multisensor feature tracking is illustrated in Fig. 3. In 
the figure am object O is imaged by n sensors (<Si, . . . ,<S„). If a feature belonging to 0 is 
visually consistent among the sensors, then a measurement may be made of the feature’s 
location in each sensor’s image plane. Each measurement can be treated as an independent 
measurement of a feature’s optic flow as seen by different sensors. 
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Figure 4: Multisensor feature tracking. 


An extended Kalman filter is used to estimate the range for a feature as well as to 
fuse measurements from several sensors. Measurement fusion is effected by linearizing the 
measurement update equation of the filter for the appropriate sensor. The Kalman filter 
time update is used to generate a prediction of a feature’s location in one of the sensor axes 
at the next image sampling time. Time update, coupled with perspective projection, gives a 
prediction of a feature’s location in each sensor’s image plane at the next sampling time [7]. 

The feature tracking algorithm is currently limited to tracking features in imagery where 
the sensor geometry has been fixed. This limitation is only a practical matter, not a theo- 
retical restriction. A fixed sensor geometry means that during the tracking of features the 
relation (position and orientation) of each sensor to all the other sensors and the vehicle is 
fixed, thus disallowing pan/tilt sensor mounts. 


5.1 Autonomous Tracking Units 

Low-level feature tracking is parallel in nature. Each feature, once it is detected, is assigned 
an autonomous tracking unit (ATU). The ATU can be implemented in software as a separate 
process or thread. The tracking unit can track a feature in a single sensor over time. The 
tracking unit can also obtain multiple measurements by using more sensors. Fig. 4 depicts 
the strategy used to track a single feature in a two-sensor stereo image sequence. A feature 
is initially detected at image location [un,vu] in the master sensor. The ATU tracks the 
feature over time within the master sensor, while a match to a slave sensor is used to add 

1 Since each sensor will have different dynamics due to helicopter motion, the Kalman filter will need to 
be linearized for each sensor during measurement update. 
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stereoscopic information to the range estimation process. The range is initialized using a 
pure stereo measurement. The tracking strategy of Fig. 4 can accommodate any number of 
like sensors. 

5.1.1 Feature detection 

The feature tracking mechanism begins with the process of feature selection by partitioning 
the master image using a cell grid. Each cell is a square pixel area with an odd number 
of pixels, 2n + 1, to a side. Each grid cell is scanned to see if an image feature is present. 
Features can therefore be detected only with the spatial accuracy (within the image plane) 
of the grid resolution. Given an N x N image the grid would have N/(2n + 1) x N/(2n + 1) 
cells. The cell grid greatly speeds up feature extraction because only 7V 2 /(4n 2 -f 4n + 1) 
locations need to be searched instead of N 2 locations which many traditional token-based 
matching schemes require [19]. A cell size of 11 x 11 pixels gives good overall performance, 
balancing matching accuracy (discussed later) versus spatial resolution. A 512 x 512 pixel 
image would therefore be divided into 46 x 46 cells with 6 pixels remaining along two of the 
edges. 

For this implementation, feature selection is based on intensity variance within a grid 
cell. The following equation shows the intensity variance calculation at a grid cell centered 
at [u,v] of size (2n + 1) x (2n + 1) pixels (n = 5 for a 11 x 11 cell size, and let N=(2n+1)): 

Hl{u,v) = 4^12 12 I{u + i,v + j) (22) 

t= — n n 

= JJTZTi 12 12 (/(u + i.u + j) -/*/(“, *>» 2 (23) 

1 i= — n )=—n 

If <7j is greater than a constant threshold value <r^, then the image location [u,t>] is said to 
be a feature. 

5.1.2 Search window 

Once a feature has been detected in a grid cell, a correspondence is generated between it and 
an identically sized pixel area in another image. This provides measurement of the feature’s 
optical flow. The target image may be taken from another sensor at the same time (stereo) or 
from the same sensor at a different time (motion) or a combination of both (motion/stereo). 
Search windows are generated by projecting the estimated three-dimensional feature location 
onto the target sensor image plane. 

If a feature is new (i.e., has no range estimate) then a worst case guess is made based 
on a priori near and far clipping planes of a range volume in front of the helicopter. Fig. 
5 illustrates the search window generation procedure for a newly detected feature in the 
motion/stereo approach. An object O gives rise to a feature in sensor <Sj (the master sensor) 
where p,i intersects the image plane. Only the feature’s basis vector e Pjl is known, since only 
a single sensor can be used to detect new features. The minimum, nominal, and maximum 
lengths for p, i can be computed using e Ptl and the near and far clipping planes, Rmi n and 
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Rmax • The nominal value of p 3 1 is calculated using R„ om = ( Rmax + #m«n)/2. The three 
vectors p 3 \ min , p,i nom , and p 3 \ mai are generated using the following equation with R equal to 
appropriate clipping value: 


Pslx 


’ R ' 

Paly 

= 


Palz 

R 



The three vectors p a lmjB , p,i nom and p,\ max can then be transformed from the master sensor 
axis at time ( k — 1) to the appropriate target sensor at time ( k ) (<S„ in Fig. 5). The 
three vectors, now resolved into the S n sensor axis at time (k), can be projected onto the 
appropriate target image plane using perspective projection. The result of projecting these 
three vectors is an image plane boundary where the feature should lie if the object 0 was 
within the minimum and maximum range clipping planes along the vector e Ptl . Fig. 5 also 
shows the search window generated by the image plane boundary. The location [ ^mm i ^min] 
corresponds to the projected vector p 3nmin ', likewise for [u nom , u nom ] and [u moi , v mai ]. The 
search window is a diamond shape with the width of the diamond equal to twice the distance 
from [u nom , u nom ] to [u moi , v max ). This shape approximates a three-dimensional error ellipsoid 
projected onto the image plane. 

Once a feature has an initial range estimate, the near and far clipping planes are derived 
from the state covariance matrix generated by the Kalman filter. Therefore, as the Kalman 
filter for a particular feature converges, the clipping planes used for search window generation 
collapse around the correct range, decreasing the size of the search window and reducing 
computation. A minimum search window size is enforced to prevent the search window from 
shrinking too much during convergence. 

5.1.3 Correlation 

The tracking unit uses the search window computed from a feature’s range estimate to find 
all the pixels in the target sensor image where the feature may be located. A correlation 
calculation is then made between the cell in the master sensor which bounds the feature and 
cell sized regions in the target image centered at each pixel within the search window. 

The correlation operates on the original pixel intensities rather than image-derived mea- 
surements to use as much of the actual image information as possible. The result of the 
correlation is a value in the interval [0,1], where 1 indicates a perfect correlation and 0 
an uncorrelated match. The algorithm can use any normalized two-dimensional correlation 
method. We have achieved good performance using normalized correlation which is presented 
below: 2 


(2n + 1 Ypab - 

\J{2n + l) 2 ^ — ~p^yj{2n + 1) 2 <xb — /*b 


(25) 


2 It should be noted that standard normalized correlation produces a correlation value on the interval 
[— 1, 1]. Our implementation maps all negative correlations onto zero. 
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Search Window 


Figure 5: Search window generation. 


where 


/M = [£ £/*(«-*,< >~j)] ( 26 ) 

t=— n j=—n 

l*B = [Y1 13 ~ *’ 9 ” ■*')] 

t=— n j=—n 

n n 

Hab = \YI L 

»=— n j=— n 

»*={£ t < 27 > 

i=— n j=— n 

<tb = 1X3 13 ^S(p-*»9-i)] 

t=— n j=-n 


The feature is detected at [u, u] in image A and is correlated with image B at location 
[p, <7] as given by the search window. The result of all the correlations is a correlation surface 
the size of the search window. The image plane location of the peak value of the surface is 
the “best match” between the cell in the master image and search area in the target image. 
Quadratic interpolation of the correlation function separately along u and v is used to refine 
the estimate of the peak to subpixel accuracy. When the next frame is acquired the nearest 
pixel to the subpixel location is chosen as the location of the feature. The error between 
the subpixel feature location and the closest integer pixel is then added into the subpixel 
interpolation for the next frame. Thus features are correlated on integer pixel boundaries 
but tracking is maintained at subpixel accuracy over multiple frames. 


6 Virtual Processing Regions 

The autonomous tracking units described in the previous section are task-parallel in nature. 
Once a feature is detected within the cell grid, an ATU is spawned to track the feature. If 
a feature leaves the image plane or otherwise becomes untrackable then the ATU dies. As 
motion imagery evolves, ATUs will track the optical flow within the image. Thus an ATU 
will generally flow from the center of the image toward an edge (assuming forward motion). 
If each ATU spawned by a particular grid cell is assigned to a processor then the data 
requirements (image data) for that processor must be the union of the data requirements 
of each ATU spawned by that cell. This is not a problem when the features are young and 
close to the originating grid cell. Over time, though, the ATUs will spread out from the 
originating cell and could potentially cover much of the image. Therefore, the autonomous 
nature of ATUs will lead to data requirements which will evolve to be nonlocal within the 

image space. 

Nonlocal data requirements for each task can lead to poor performance in a multipro- 
cessor system. This is due mainly to the communication overhead of either sending large 
portions of the data space to each processor in a distributed-memory system, or memory 
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Figure 6: Image plane partitioning. 

network contention in a shared-memory system [20]. The implication of this is that task par- 
allelism alone is not sufficient to efficiently map the algorithm onto simple parallel hardware. 
To overcome the data locality requirements, a higher level abstraction is introduced above 
the level of the ATU. This abstraction, the virtual processor region (VPR), adds spatial 
locality restrictions to each ATU within the image space. 

Fig. 6 illustrates the idea behind virtual processor regions. The textured squares rep- 
resent the location of ATUs within a master sensor image plane. The ATUs are arranged to 
simulate the tracking of two trees and several ground features. The image is divided into 8x8 
VPRs (heavy lines). Each VPR is responsible for maintaining a rectangular arrangement of 
grid cells. In this example each VPR is allocated 5x5 grid cells (thin lines). The boundaries 
for the VPRs are the same as for the underlying cell grid. The maximum number of VPRs 
is equal to the number of grid cells. 3 

The VPRs represent separate regions within the image plane that can be allocated to a 
processing element (PE). In the example of Fig. 6 there are 64 VPRs which can be distributed 
among up to 64 processing elements in a task/data parallel fashion. Each PE processes 
the ATUs (textured squares) and performs feature detection in untracked grid cells (white 
squares) which are contained within its assigned VPR. Since the VPRs are spatially allocated 
their image data requirements are fixed. Therefore, as an ATU tracks a feature across a VPR 
boundary, the VPR passes the ATU to the appropriate neighboring VPR before the next 
image set is acquired. Currently each VPR is given enough image pixels 4 surrounding the 

3 In the case of 512 x 512 pixel images with 11x11 pixel cells there can be as few as one VPR or as many 
as 2116 VPRs. 

4 Each VPR is currently given a ten pixel wide image strip from each of its neighbors. The size of this 
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Figure 7: Feature tracking algorithm flowchart. 


VPR proper such that ATUs can be tracked into a neighboring VPR’s image space without 
the need for interprocessor communication. This is an interim solution allowing for more 
flexibility in the algorithm such that a wider range of architectures may be considered. If an 
architecture has very cheap nearest-neighbor communication then ATUs would be designed 
to be transferred between VPRs during the tracking phase of the algorithm. If the tracked 
features have inter-image shifts greater than can be handled by the extra pixels sent along 
with a VPR, then interprocessor communication during the tracking phase will be necessary. 

Fig. 7 depicts the feature tracking method as a flow chart using the definitions of au- 
tonomous tracking units and virtual processing regions. The feature tracking algorithm 
based on virtual processing regions exhibits Single Program Multiple Data (SPMD) paral- 
lelism. Each VPR can be processed in parallel as soon as its input data have been supplied. 
The VPR however can exploit further parallelism at the ATU level if it contains more than 
one grid cell. Each ATU/grid cell can be processed in parallel. The data requirements for 
each ATU are implicitly supplied by the parent VPR. The ATU in turn is composed of a 
series of serial matrix-like operations which exhibit vector-like parallelism. The number next 
to each ellipsoid indicates the aggregate percentage of total computation needed by each 
function. 

The motivation behind the ATU/VPR construct is flexibility. The feature tracker can 
be configured to use as few as one or as many as a couple thousand VPRs. Changing the 
number of VPRs obviously affects the ATU per VPR ratio. On a highly parallel machine 
(with several thousand processors) each processor would be assigned either an empty grid 

strip is based on the highest inter-image shift expected during tracking. 
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cell or an active ATU; there would be no need for VPRs. Since the algorithm has been 
designed to be ported to different architectures, the VPR construct is necessary to reduce 
the number of task/data parallel units to the optimal number for a given architecture and 
load balancing scheme. 

The ellipsoid labeled Reassignment in Fig. 7 indicates the section of the code which 
detects when an ATU crosses a VPR boundary. As stated previously, since this section 
could use interprocessor communication it is left as a serial section until an architecture port 
is made. Since reassignment is relatively cheap on a uniprocessor computer, its parallelization 
has not been considered in this paper. 

7 Load Balancing 

Each virtual processing region is task and data independent. The computational load rep- 
resented by each VPR is proportional to the number of ATUs being managed by that VPR. 
If the feature distribution in a scene is nonuniform then the number of ATUs per VPR may 
vary greatly over the set of VPRs. If this occurs then a load balancing technique would 
be needed to most effectively utilize every PE in a parallel system. Three load balancing 
techniques have been explored: uniform partitioning and static and dynamic scheduling [10]: 

7.1 Uniform partitioning 

If scene content is such that features are uniformly distributed over the image plane, then 
an allocation scheme which creates equal sized partitions would be optimal. Given an N 
processor machine the image plane would be divided into N equal sized VPRs; one VPR 
per PE. No explicit load balancing would be necessary to equally utilize each PE due to the 
uniform distribution of features in the image plane. 

7.2 Static scheduling 

If scene content is such that features are not distributed uniformly (which is most often the 
case), then a load balancing technique will be needed to make efficient use of every PE in a 
system. The major computational load of each VPR is performing the correlations necessary 
for feature tracking. If the time to scan a cell for a new feature is tj, the time to generate a 
correlation surface is t c and the time to perform measurement and time update is </; then if 
the tth VPR has A,- untracked cells and B x ATUs (tracked features), the computation time 
for the tth VPR, r,, can be approximated by 

tv fts Aitd + Bi(t c + tj + tj) (28) 

If the number of features per VPR does not change too rapidly during steady-state feature 
tracking, then it would be possible to perform static scheduling for the current frame time 
based on each VPR’s estimated computation time r, from the previous frame. 

Given that the master image is divided into M VPRs, static scheduling attempts to 
distribute all M VPRs from the current frame onto a set of N processors so as to minimize 
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completion time. It is required that M > N so that the scheduler has the resolution to 
properly distribute the load. More precisely: Given N processors, a deadline D and an M 
element set, X , of VPRs each with estimated computation time r,-, select a disjoint partition 
of X = Xi U X 2 U • • ■ U X N such that 

max | yi T » : i — j — 

The estimated time necessary to process all VPRs with a uniprocessor is 

T = f>, (30) 

t=l 

Thus the best case deadline D possible, given N processors, is T/N. This is known as 
the Multiprocessor Scheduling Problem and has been shown to be NP-complete [21]. The 
challenge of static scheduling is to choose the partitions Xj in a computationally efficient 
manner such that the maximum PE computation time approaches T /N . 


7.3 Dynamic scheduling 

If scene content changes rapidly such that the number of tracked features per VPR fluctuates 
from one frame to the next or the number of correlations necessary to generate a correlation 
surface fluctuates from one feature to the next, then equation (28) will not be an accurate 
representation of the computation time required by a VPR. To correct for such occurrences 
a higher order model of the computation time may be formulated by taking the influencing 
dynamic factors into account. This approach, though, may lead to an overly complicated 
computational model or a model which has dominant factors which cannot be predicted 
efficiently. In such cases a dynamic approach may be used [10]. A simple method is to 
have a controlling processor distribute VPRs to slave processors from a task queue of VPRs. 
Processing elements are assigned new VPRs as they finish processing their current VPR. 
This method of dynamic VPR allocation is practical only if the communication network 
between the controlling task scheduler and the PEs does not saturate with the necessary 
communication overhead. 

None of the scheduling schemes presented above explicitly takes into account the com- 
munication time necessary to download data to each PE. Uniform partitioning will be ad- 
versely affected if communication time becomes comparable to computation time. For static 
scheduling the communication time can be factored into the scheduler, if the delays are de- 
terministic, by introducing it as another element in the cost function. Dynamic scheduling 
has the benefit of inherently adapting to nondeterministic communication delays and time 
varying unbalanced architectures [10]. 
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Figure 8: Image with corresponding intensity coded range map. 

8 Implementation Results 

The parallel constructs and load balancing methods described earlier allow implementation 
of the feature tracking algorithm on distributed-memory computers or multithreaded shared- 
memory computers or a combination of both. An ideal evaluation consists of comparing each 
architecture/load balancing combination and choosing the scheme which performs best. Due 
to limitations in available hardware and software, only a subset of the schemes has currently 
been implemented and compared. We present several schemes which represent trends in 
current parallel computer systems design which have a strong effect on the performance of 
the algorithm. The execution time and speedup results of the various implementations will 
suggest further architectures and software design issues for investigation. 

The following subsections describe a distributed-memory machine based on a network 
of workstation-class computers and a modern shared-memory multiprocessor. The feature 
tracking algorithm was ported to each architecture and, if the operating system software 
allowed, each load balancing scheme was examined. 

Fig. 8 shows the 20th master sensor image along with its corresponding range map. The 
image is from a sequence of 240 stereo image pairs generated using a computer controlled 
motion table and scaled helicopter dynamics recorded from flight [22, 23]. The range map, 
composed of 1450 tracked features, has been projected onto the master image plane with the 
range coded by intensity. The cumulative execution time to process the first 20 image pairs 
of the sequence was used as the basis of comparison for each of the computer/load balancing 
schemes tested. 
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Cumulative Execution Time (seconds) 



Figure 9: Effect of VPR partitioning scheme and number of PEs on cumulative execution 
time and speedup for distributed-memory. 

8.1 Loosely coupled distributed-memory machine 

A loosely coupled distributed-memory machine may easily be constructed with an Ethernet 
network of UNIX 5 workstations using the Berkeley sockets library. The benefits of such 
an implementation are low cost and ease of configurability. The disadvantage is that the 
resulting multicomputer is very limited by the bandwidth and latency of the local area 
network. 

The feature tracking algorithm was structured so that a single PE (control node) would 
run code that would schedule and distribute VPRs to slave PEs (compute nodes). The 
slave PEs would contain the bulk of the feature tracking code and could each process a set 
of VPRs. Load balancing is directly controlled by the control node; therefore, each load 
balancing scheme discussed may be tested. 

The distributed machine uses a network of nine Sun workstations. Each compute node 
is a SparcStation2 and the control node is a SparcServer 630MP. During the timing test, only 
the feature tracking software and routine low-overhead operating system support software 
were executing on each node. 

Fig. 9(a) and (b) and 11(a) and (b) show graphs of the cumulative execution time and 
the speedup for the distributed-memory machine as the number compute nodes is increased. 
Fig. 9(b) shows the effect that feature distribution has on speedup using a simple uniform 
scheduler. For this figure the master image was divided into eight equal-area vertical VPRs 
and compared with eight equal-area horizontal VPRs. By comparing the speedup graph in 

5 UNIX is a registered trademark of AT&T. 
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Figure 11: Effect of load balancing schemes and number of PEs on cumulative execution 
time and speedup for distributed-memory. 
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Figure 12: Feature distribution for static allocation compared with dynamic allocation using 
eight PEs. 



Figure 13: Computational distribution for static allocation. 
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Fig. 9 with the feature distribution in Fig. 8, one can see that a vertical partitioning will 
more evenly allocate the ATUs among the processors than a horizontal partitioning for this 
particular image sequence. Fig. 10(a) and (b) shows these feature distributions graphically 
for eight compute nodes during the last frame. 

Fig. 11(a) and (b) compare the uniform, static, and dynamic load balancing schemes 
using M = 64 VPRs (much like Fig. 6). We found that choosing M — N 2 when N is small 
(i.e., N < 8), will give fine enough resolution for the static or dynamic scheduler to perform 
an efficient schedule for the tested feature distributions. 

The VPRs of the uniform method are distributed by vertical columns plus fractions of 
a column when necessary. This method schedules 64 /N VPRs to each compute node, where 
N is the number of nodes. When N = 8 the feature distribution for 64 VPRs is identical to 
the vertical VPRs of Fig. 10(a). 

We solved the static scheduling problem stated earlier (i.e., computation of the partitions 
Xj) with a computationally efficient bin packing 6 heuristic, “ First Fit Decreasing ” (FFD). 
The FFD algorithm operates as follows: VPRs are taken in order of nonincreasing r, and 
assigned to the first PE which has enough computational capacity to accommodate it. The 
major computational burden of this method is the initial 0(M log 2 M) sort of estimated 
compute times. The benefits of this algorithm are twofold: first, it is easy to implement 
and modify and, second, it has been shown to have several strong properties of asymptotic 
optimality [24]. 

One slight modification to the theoretical static scheduler was made prior to imple- 
mentation. The estimated compute time Ti in equation (28) is heavily dependent on the 
correlation surface time t c . A computer trace (SparcStation2) of the feature tracker gives 
the following results: t c = 20.25 ms, tj, = 1.08 ms, and tj = 0.68 ms. In light of these 
numbers the static scheduler was implemented such that Bi from equation (28) was the es- 
timate of the computational load of each VPR (i.e., letting = 0 and tj = 0, ignoring any 
reference to true time). VPRs with no actively tracked features are evenly divided among 
all the processors. Fig. 12(a) shows the feature distribution of the static scheduler during 
the 20th frame, while Fig. 13 shows the compute time distribution during the same frame. 
Comparisons of these graphs lead us to believe that load balancing using only the number 
of active features per VPR may be accurate enough to perform the processor allocation. 
The benefit of this modification is also twofold: first, accurate estimates of tctj and tf are 
not necessary as long a s t c (tj + </) and, second, the static scheduler uses only integer 
mathematics which speeds computation. 

The feature distribution for the dynamic scheduler is shown in Fig. 12(b). We can see 
that only the first six nodes are highly utilized. This is because of the low speed of the 
node interconnect (Ethernet). From the cumulative time graphs it is clear that the dynamic 
scheduler outperforms uniform partitioning for more than two nodes. It also outperforms 
the static scheduler for more than three nodes. Fig. 12(a) would seem to indicate that the 
static scheduler should outperform the dynamic scheduler because it does a better job of 
distributing the load. This is not the case, because, as the number of nodes increases, the 
communication time begins to dominate the total processing time. With the nondynamic 

6 Bin Packing is closely related to the Multiprocessor Scheduling Problem [21]. 
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Cumulative Execution Time (eeconda) 



Figure 14: Cumulative execution time and speedup for shared-memory. 

schedulers there will be wasted time after a node finishes processing its VPRs until the 
distribution node retrieves the results. From these graphs it is clear that the Ethernet 
saturates at six nodes for this application regardless of the scheduler. 

The caveat for the dynamic scheduling is the poor scalability of the communication 
overhead needed to support nonblocking I/O between the distribution node and the compute 
nodes. Since the static scheduler performs nearly as well as the dynamic scheduler, we predict 
that the static method will win out as the number of nodes increases. 

8.2 Shared-memory machine 

A Silicon Graphics IRIS 4D/480 was used to implement a multithreaded shared-memory 
version of the feature tracker. The 4D/480 has eight RISC processors in a shared- memory 
configuration. It is typical of modern shared-memory multicomputers in that the processing 
elements are unbalanced with respect to the shared- memory interconnect [25]. This results 
from the fact that memory interconnect speeds have not kept pace with the increased speed 
of modern RISC CPUs. The 4D/480 also comes with a limited multithread support library 
based on the Sequent Computer Systems parallel programming primitives. 

Ideally for dynamic scheduling, each VPR would be assigned a thread, where the number 
of threads is greater than the number of processors. The operating system would then 
perform dynamic thread management. Such a method would be comparable to the dynamic 
scheduler for the distributed-memory machine. The Sequent/SGI primitives allow for thread 
process management of a limited number of threads (default maximum of eight). With so 
few threads available, any load balancing scheme based on a large number of threads is 
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unworkable. Therefore the only scheduling schemes which could easily be implemented were 
those based upon a limited number of threads: uniform and static scheduling. 

The uniform partitioning method was the easiest to implement. The master image was 
partitioned into vertical VPRs, starting with a single VPR and working up to eight VPRs. A 
thread was assigned to each VPR. Since the number of threads was always less than or equal 
to the number of processors, the thread count was equal to the utilized processor count. Fig. 
14 shows graphs for the cumulative execution time and the speedup for uniform partitioning. 

Also in Fig. 14 are results from an implementation of static scheduling on the 4D/480. 
Instead of assigning one VPR to each thread the load balancer generates an index map 
whereby each thread can address the appropriate set of VPRs from shared-memory. The 
overhead of the static scheduler is outweighed by the increased efficiency as can be seen in 
the graphs. 

9 Conclusions 

The parallelization of the multisensor feature-based range-estimation software has proven 
quite successful. The method has shown good speedup with up to eight processors. We 
have shown that the algorithm, even though it is complex and data-driven, can be efficiently 
parallelized into many independent task/data units which may be processed by a distributed- 
memory or a shared-memory parallel computer. The Silicon Graphics 4D/480, using all eight 
processors and the static scheduler, was able to process the 1450 features of the 20th frame 
in 2.59 seconds. Thus to reach ten frames a second with a maximum of 1500 features we 
will need to speed the processing up 26 times. This is a realistic goal which can be achieved 
in the near future by increasing the number of processing elements, their performance, and 
interconnect speeds. 

The most detrimental aspect of our distributed-memory computer was the low band- 
width of the node interconnect. To combat this problem we will consider new systems with 
much faster node interconnect speeds. We are planning to port the algorithm to an Intel 
iWarp 7 systolic mesh computer [26] and possibly the next generation of Silicon Graphics 
Multi processors . 

In this paper we have not focused on parallel feature reassignment or image data acqui- 
sition and distribution. These are important issues in a real-time system and will be topics 
in the next phase of our research. 
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